Determining a threshold for binary dependent variables

When predicting a binary dependent variable, the output of your model is usually a probability or is easily converted to a probability. Many times it is desirable to convert this probability to a binary variable to match the dependent variable.

For example, if you are predicting whether a customer will buy a product, you may want to convert the probability to buy into a binary prediction of buy/not buy for each consumer in your scoring data set. The default for most algorithms is 0.50. That is, if a consumer has a probability greater than 50% they are predicted as a buyer. If a consumer has a probability less than 50% they are predicted as a non-buyer.

But is .50 the right cut-off? Just because it is the default, does not mean it is necessary the best. In this notebook, we examine a method to determine the best cut-off based on the context of your business problem.

Note that if you are using these methods on a real-world problem, make sure that you determine the best cut-off on your TRAINING DATA SET and confirm the results on your TESTING and VALDIATION data sets.

Before we get too far, I just want to point out that are many different ways to address this problem. I use this method because it works for me, not because it is superior to any other method.

Table of contents

1. Getting Setup

1.1 Install all of the relevant Python Libraries

In [ ]:
#!pip install --upgrade numpy 
#!pip install plotly --upgrade
!pip install chart-studio --upgrade

1.2 Import relevant libraries

In [3]:
import chart_studio.plotly as py
import plotly.graph_objs as go
import numpy.dual as dual
import plotly as plotly
import pandas as pd
from botocore.client import Config
import ibm_boto3
import numpy as np
import numpy.dual as dual
from sklearn import metrics

import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0


#Un-Comment these options if you want to exapand the number of rows and columns of you see visually in the notebook.
#pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)

1.3 Pull Data from GitHub

In [ ]:
!rm df_for_export-2.csv
!wget https://raw.githubusercontent.com/shadgriffin/best_cut_off/master/df_for_export-2.csv
In [5]:
pd_data = pd.read_csv("df_for_export-2.csv", sep=",", header=0)

2. Data Exploration

This is a simple sample data set which represents equipment failure for an oil and gas company.

  • ID: unique identifier for a piece of equipment
  • DATE: date of the observation
  • FAILURE_TARGET: the original dependent variable. It is binary where a 1 indicates that the equipment failed on a particular day and a 0 means that the equipment did not fail on a particular day
  • P_FAIL: the probability that the equipment will fail on a particular day. It comes from a previously built predictive model.
In [6]:
pd_data.head()
Out[6]:
ID DATE FAILURE_TARGET P_FAIL
0 100003 4/5/16 0 0.177485
1 100003 3/30/16 0 0.208043
2 100003 3/31/16 0 0.194013
3 100003 4/1/16 0 0.158498
4 100003 4/2/16 0 0.190317

For example, in the first record above, for ID 1000003 on 04/05/2016 the probability to fail was .177485 and it did not fail.

The objective is to find the probability cut-off (P_FAIL) that best represents the actual target (dependent variable).

3. Introducing a Gains Table

A Gains table is a useful tool for evaluating the accuracy of a model where the predicted variable is binary. A Gains table is easy to explain and extremely effective in determining the fitness of a machine learning model.

3.1 Assign Deciles based on the probability

In [7]:
dfx=pd_data
In [8]:
#Sort the data by Well_id and Date

dfx=dfx.sort_values(by=["P_FAIL"], ascending=[False])
In [9]:
#add a very small random number to the probability to break ties

dfx['wookie'] = (np.random.randint(0, 100, dfx.shape[0]))/100000000000000000

dfx['P_FAIL']=dfx['P_FAIL']+dfx['wookie']
In [10]:
#Create deciles based on P_FAIL
dfx['DECILE'] = pd.qcut(dfx['P_FAIL'], 10, labels=np.arange(100, 0, -10))

3.2 Assemble the Gains Table

In [11]:
# Find the minimum probability for each decile
tips_summedv = pd.DataFrame(dfx.groupby(['DECILE'])['P_FAIL'].min())
# Find the maximum probability of each decile
tips_summedw = pd.DataFrame(dfx.groupby(['DECILE'])['P_FAIL'].max())
# Find the Actual Failure rate for each decile.
tips_summedx = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].mean())
#Sum the number of Failures in each decile.
tips_summedy = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].sum())
# count the records in each decile
tips_summedz = pd.DataFrame(dfx.groupby(['DECILE'])['FAILURE_TARGET'].count())

#Aggregate the summaries into one dataframe
tips = pd.concat([tips_summedv,tips_summedw, tips_summedx, tips_summedy,tips_summedz], axis=1)
tips.columns = ['MIN_SCORE','MAX_SCORE','FAILURE_RATE','FAILURES', 'OBS']

tips=tips.sort_values(by=['DECILE'], ascending=[False])
gains=tips
#Find the number of cumulative failures by decile.
gains['CUML_FAILURES']=gains['FAILURES'].cumsum()
#Find the percentage of failures in each decile
gains['PCT_OF_FAILURES']=(gains.FAILURES)/(dfx['FAILURE_TARGET'].sum())*100
#Find the cumulative percentage of failures in each decile.
gains['CUML_PCT_OF_FAILURES']=gains.PCT_OF_FAILURES.cumsum()
#Format the final output
gains=gains[['OBS','MIN_SCORE','MAX_SCORE','FAILURES','FAILURE_RATE','PCT_OF_FAILURES','CUML_FAILURES','CUML_PCT_OF_FAILURES']]

gains
Out[11]:
OBS MIN_SCORE MAX_SCORE FAILURES FAILURE_RATE PCT_OF_FAILURES CUML_FAILURES CUML_PCT_OF_FAILURES
DECILE
10 16521 0.634321 0.975795 3265 0.197627 69.959289 3265 69.959289
20 16520 0.455667 0.634312 818 0.049516 17.527319 4083 87.486608
30 16521 0.335289 0.455661 354 0.021427 7.585172 4437 95.071781
40 16520 0.252277 0.335284 143 0.008656 3.064067 4580 98.135847
50 16521 0.191109 0.252276 62 0.003753 1.328477 4642 99.464324
60 16520 0.146307 0.191104 20 0.001211 0.428541 4662 99.892865
70 16521 0.111849 0.146306 3 0.000182 0.064281 4665 99.957146
80 16520 0.080237 0.111848 2 0.000121 0.042854 4667 100.000000
90 16521 0.041927 0.080236 0 0.000000 0.000000 4667 100.000000
100 16521 0.001404 0.041925 0 0.000000 0.000000 4667 100.000000

The gains table above shows the relationship between the predicted value (P_FAIL) and the dependent variable (FAILURE_TARGET).

Interpreting the gains table is straight-forward. If the deciles were assigned randomly, you would expect 10% of the failures to fall in each decile. When you use P_FAIL (the predicted probability) to create the deciles, 69% of the failures fall in the top decile. Or, the failure rate in the top decile is 19% (1 in 5). The failure rate in the bottom decile is 0%. All of this means that P_FAIL does a great job of predicting FAILURE_RATE.

Next, we use the concept of a gains table to find the best cut-off.

4. Finding the best cut-off

First, let's define a few things. We just showed how a gains table is used to evaluate the effectiveness of a model. A confusion matrix is another way to gauge the effectiveness of a model.

A confusion matrix uses a cut-off value and then assigns each prediction into a binary yes/no format consistent with your business problem. In this case, we would need to classify each observation as either a predicted failure (1) or a predicted non-failure (0). Once we make this classification, a confusion matrix is simply the cross-tabulation of the column representing the predicted failure/non-failure and the column representing the actual failure/non-failure.

A confusion matrix communicates four different possible outcomes. Again, a cut-off (threshold) is required to build a confusion matrix.

  1. The model predicts a failure and the equipment did fail. This is a True Positive.
  2. The model predicts a failure and the equipment did not fail. This is a False Positive.
  3. The model predicts a non-failure and the equipment does not fail. This is a True Negative.
  4. The model predicts a non-failure and the equipment does fail. This is a False Negative.

In finding the best cut-off, our goal is to minimize the cost of misclassified predictions and maximize benefit of correctly classified predictions.

In other words, we want to select a cut-off that minimizes the impact of False Positives and False Negatives, while maximizing the impact of True Positives and True Negatives.

The best cut-off minimizes the cost of false positives and false negative.

Here are few other pertinent definitions:

  • Sensitivity (True Positive Rate) = Of all actual Positives, the percentage correctly classified as positive. (True Positives/(True Positives + False Negatives)).
  • Specificity (True Negative Rate) = Of all actual Negatives, the percentage correctly classified as negative. (True Negatives /(False Positives + True Negatives)).
  • False Positive Rate = Of all Actual Positives, the percentage incorrectly classified as negative. (1-Sensitivity)
  • False Negative Rate = Of all Actual Negatives, the percentage incorrectly classified as positive. (1-Specificity)

In our data set:

  • False Positive Rate = Of all actual failures, the percentage incorrectly classified as non-failures.
  • False Negative Rate = Of all actual non-failures, the percentage incorrectly classified as failures.

Again, our object with determining a cut-off is to minimize the cost of False positives and False negatives.

4.1 Morphing a Gains Table into a continuum of cut-offs

In [12]:
dfx=pd_data

Add a small negative random number to P_FAIL to break ties when grouping.

In [13]:
dfx['wookie'] = (np.random.randint(0, 100, dfx.shape[0]))/100000000000000000

dfx['P_FAIL']=dfx['P_FAIL']+dfx['wookie']

Instead of deciles, we create 10000 groups, based on the probability to fail.

In [14]:
dfx['GROUPS'] = pd.qcut(dfx['P_FAIL'], 10000, labels=False)
In [15]:
# find the minimum P_FAIL for each group.  This is a potential cut-off point.
tips_summedb = pd.DataFrame(dfx.groupby(['GROUPS'])['P_FAIL'].min())

#Find the number of Failures in each group
tips_summedz = pd.DataFrame(dfx.groupby(['GROUPS'])['FAILURE_TARGET'].sum())
#find the number of observations in each group
tips_summeda = pd.DataFrame(dfx.groupby(['GROUPS'])['FAILURE_TARGET'].count())

#append the summaries into one dataframe
tips = pd.concat([tips_summedb,tips_summedz, tips_summeda], axis=1)
tips.columns = ['CUT-OFF','FAILURES', 'OBS']

#find the number of non-failures
tips['NON_FAILURES']=tips.OBS-tips.FAILURES

#reset the index to make GROUPS a column
tips.reset_index(level=0, inplace=True)

#sort the dataframe by groups in descending order
tips=tips.sort_values(by=['GROUPS'], ascending=[False])


# Cumulative sum the failures, non-failures and observations
tips['INV_CUM_FAILURES'] = tips.FAILURES.cumsum()
tips['INV_CUM_NON_FAILURES'] = tips.NON_FAILURES.cumsum()
tips['TOTAL_OBS']=tips.OBS.sum()


#Sort the data by Groups ascending
tips=tips.sort_values(by=['GROUPS'], ascending=[True])


#calculate the total number of failures and non-failures
tips['CUM_FAILURES'] = tips.FAILURES.cumsum()
tips['CUM_NON_FAILURES'] = tips.NON_FAILURES.cumsum()


#find the total number of failures for the whole dataset.
tips['TOTAL_FAILURES']=tips.FAILURES.sum()
tips['TOTAL_NON_FAILURES']=tips.NON_FAILURES.sum()
#define the true positives for each cut-off
tips['TRUE_POSITIVES']=tips.INV_CUM_FAILURES
#define the false positives for each cut-off
tips['FALSE_POSITIVES']=tips.INV_CUM_NON_FAILURES
#define the true negatives for each cut-off
tips['TRUE_NEGATIVES']=tips.CUM_NON_FAILURES-tips.NON_FAILURES
#define the false negatives for each cut-off
tips['FALSE_NEGATIVES']=tips.CUM_FAILURES-tips.FAILURES
#double check the logic and arithmetic.
tips['OBS2']=tips.TRUE_POSITIVES+tips.FALSE_POSITIVES+tips.TRUE_NEGATIVES+tips.FALSE_NEGATIVES

# define the sensitvity for each cut-off
tips['SENSITIVITY']=tips['TRUE_POSITIVES']/(tips['TRUE_POSITIVES']+tips['FALSE_NEGATIVES'])
#define the specificity for each cut-off
tips['SPECIFICITY']=tips['TRUE_NEGATIVES']/(tips['FALSE_POSITIVES']+tips['TRUE_NEGATIVES'])
#define the false positive rate for each cut-off
tips['FALSE_POSITIVE_RATE']=1-tips['SPECIFICITY']
#define the false negative rate for each cut-off
tips['FALSE_NEGATIVE_RATE']=1-tips['SENSITIVITY']

tipsx=tips

So, in the table below, the cut-off is in the second column. For example, if you use a cut-off of .0003173, all p-values greater than .003173 are labeled as predicted failures. All p-values less than .003171 are predicted non-failures. Using .003171 as a cut-off means that you will have:

4667 True Positives
160505 False Positives
34 True Negatives
0 False Negatives.
In [16]:
gains=tipsx[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
            'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE']]

gains
Out[16]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE
0 0 0.001404 4667 160539 0 0 1.000000 0.000000 1.000000 0.000000
1 1 0.002718 4667 160522 17 0 1.000000 0.000106 0.999894 0.000000
2 2 0.003173 4667 160505 34 0 1.000000 0.000212 0.999788 0.000000
3 3 0.003342 4667 160489 50 0 1.000000 0.000311 0.999689 0.000000
4 4 0.003520 4667 160472 67 0 1.000000 0.000417 0.999583 0.000000
... ... ... ... ... ... ... ... ... ... ...
9995 9995 0.958857 71 12 160527 4596 0.015213 0.999925 0.000075 0.984787
9996 9996 0.959056 53 12 160527 4614 0.011356 0.999925 0.000075 0.988644
9997 9997 0.961913 43 7 160532 4624 0.009214 0.999956 0.000044 0.990786
9998 9998 0.963536 32 1 160538 4635 0.006857 0.999994 0.000006 0.993143
9999 9999 0.966488 17 0 160539 4650 0.003643 1.000000 0.000000 0.996357

10000 rows × 10 columns

4.2 Finding the cut-off with the smallest misclassification rate

In the previous step we created 10,000 potential cut-offs. Now we can determine the cut-off that minimizes the misclassification rate. The first step in this process is to calculate the miscalculation rate for each cut-off.

In [17]:
tips=tipsx

#sum the false positives and false negatives.
tips['FALSE_CLASSIFICATIONS'] = tips.FALSE_POSITIVES+tips.FALSE_NEGATIVES

#estimate the false classification rate
tips['FALSE_CLASSIFICATION_RATE']=tips.FALSE_CLASSIFICATIONS/(tips.TOTAL_OBS)



gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
            'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS','FALSE_CLASSIFICATION_RATE']]

In this first example, we calculate the simple and unweighted misclassification rate for each cut-off. Then, determine which cut-off has the smallest rate. Note, the assumption is that a false positive has the same value as a false negative. Often, this is not the case.

In [18]:
gains
Out[18]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS FALSE_CLASSIFICATION_RATE
0 0 0.001404 4667 160539 0 0 1.000000 0.000000 1.000000 0.000000 160539 0.971750
1 1 0.002718 4667 160522 17 0 1.000000 0.000106 0.999894 0.000000 160522 0.971648
2 2 0.003173 4667 160505 34 0 1.000000 0.000212 0.999788 0.000000 160505 0.971545
3 3 0.003342 4667 160489 50 0 1.000000 0.000311 0.999689 0.000000 160489 0.971448
4 4 0.003520 4667 160472 67 0 1.000000 0.000417 0.999583 0.000000 160472 0.971345
... ... ... ... ... ... ... ... ... ... ... ... ...
9995 9995 0.958857 71 12 160527 4596 0.015213 0.999925 0.000075 0.984787 4608 0.027892
9996 9996 0.959056 53 12 160527 4614 0.011356 0.999925 0.000075 0.988644 4626 0.028001
9997 9997 0.961913 43 7 160532 4624 0.009214 0.999956 0.000044 0.990786 4631 0.028032
9998 9998 0.963536 32 1 160538 4635 0.006857 0.999994 0.000006 0.993143 4636 0.028062
9999 9999 0.966488 17 0 160539 4650 0.003643 1.000000 0.000000 0.996357 4650 0.028147

10000 rows × 12 columns

That's a lot of data. Let's examine the data visually.

In [19]:
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE']



trace = go.Scatter(
    x = x1,
    y = y1,
    name='False Positive Rate')

trace2 = go.Scatter(
    x = x1,
    y = y2,
    name='False Negative Rate'
)

trace3 = go.Scatter(
    x = x1,
    y = y3,
    name='False Classification Rate'
)

layout = go.Layout(
    title='Mis-Classification Rates BY CUT OFF SCORE',
    xaxis=dict(
        title='CUT OFF SCORE',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='False Positive and False Negative Rates',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2,trace3]  
fig = go.Figure(data=data, layout=layout)

#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines') 

The chart above suggests that the best cut-off score is around 0.90.

Note that increasing the cut-off leads to more False Negatives and decreasing the cut-off leads to more False Positives.

In [20]:
gains=gains.sort_values(by=['FALSE_CLASSIFICATION_RATE'], ascending=[True])
gains=gains.head(1)
gains
Out[20]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS FALSE_CLASSIFICATION_RATE
9937 9937 0.909427 685 356 160183 3982 0.146775 0.997782 0.002218 0.853225 4338 0.026258

By querying the data, we can see the best cut-off is 0.909427.

We can now use this cut-off to build a confusion matrix.

In [21]:
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .909427)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
FAILURE_TARGET       0     1
Y_FAIL                      
0               160183  3982
1                  356   685
Out[21]:
FAILURE_TARGET 0 1
Y_FAIL
0 0.975744 0.024256
1 0.341979 0.658021

Based on this cut-off, we have 356 false positives and 3,982 false negatives.

Now, let's compare these results with a cut-off of .50.

In [22]:
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .5)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
FAILURE_TARGET       0     1
Y_FAIL                      
0               136065   794
1                24474  3873
Out[22]:
FAILURE_TARGET 0 1
Y_FAIL
0 0.994198 0.005802
1 0.863372 0.136628

Using a cut-off of .5 means that we would have significantly more false positives and substantially less false negatives.

Note, that you can use the table we created above to examine the cut-off or threshold in the context of an ROC Curve.

In [23]:
x2 = tips['CUT-OFF']
x1 = tips['FALSE_POSITIVE_RATE']
y1 = tips['SENSITIVITY']




trace = go.Scatter(
    x = x1,
    y = y1,
    name='ROC')

trace2 = go.Scatter(
    x = x2,
    y = y1,
    name='Cut-Off v TPR'
)


layout = go.Layout(
    title='ROC with Threshold Levels',
    xaxis=dict(
        title='False Positive Rate and Threshold Level',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='True Positive Rate',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2]  
fig = go.Figure(data=data, layout=layout)

#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines') 

4.3 Finding the cut-off when the costs of false positives and false negatives are not equal

In the previous example, we assumed the costs of false positives and false negatives were equal. In the real world, this is rarely the case. Take for example, airplane engines. If you have a false positive, then you replace equipment that doesn't need to be replaced. This is not ideal, but compare this to the cost of a false negative. A false negative occurs when your model predicts that an airplane engine will not fail and it does. A false negative literally could mean that a plane falls from the sky and all passengers onboard plunge to their death.

In the next scenario, we'll assume that the exact costs are not known, but we have a rough idea that a false negative is about twice as costly as a false positive.

In [24]:
tips=tipsx

#define the cost of a false positive and false negative
cost_of_a_false_positive=1
cost_of_a_false_negative=2

#convert the costs into weights that sum to one
fp_weight=(cost_of_a_false_positive)/(cost_of_a_false_positive+cost_of_a_false_negative)
fn_weight=(cost_of_a_false_negative)/(cost_of_a_false_positive+cost_of_a_false_negative)
In [25]:
#Create a weighted false classification rate based on the costs of a false positive and a false negative.

tips['FALSE_CLASSIFICATIONS_W'] = np.where(((2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)) >= tips.TOTAL_OBS)), tips.TOTAL_OBS, 2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)))

tips['FALSE_CLASSIFICATION_RATE_W']=tips.FALSE_CLASSIFICATIONS_W/(tips.TOTAL_OBS)



gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
            'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS_W','FALSE_CLASSIFICATION_RATE_W']]

gains
Out[25]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS_W FALSE_CLASSIFICATION_RATE_W
0 0 0.001404 4667 160539 0 0 1.000000 0.000000 1.000000 0.000000 107026.000000 0.647834
1 1 0.002718 4667 160522 17 0 1.000000 0.000106 0.999894 0.000000 107014.666667 0.647765
2 2 0.003173 4667 160505 34 0 1.000000 0.000212 0.999788 0.000000 107003.333333 0.647696
3 3 0.003342 4667 160489 50 0 1.000000 0.000311 0.999689 0.000000 106992.666667 0.647632
4 4 0.003520 4667 160472 67 0 1.000000 0.000417 0.999583 0.000000 106981.333333 0.647563
... ... ... ... ... ... ... ... ... ... ... ... ...
9995 9995 0.958857 71 12 160527 4596 0.015213 0.999925 0.000075 0.984787 6136.000000 0.037142
9996 9996 0.959056 53 12 160527 4614 0.011356 0.999925 0.000075 0.988644 6160.000000 0.037287
9997 9997 0.961913 43 7 160532 4624 0.009214 0.999956 0.000044 0.990786 6170.000000 0.037347
9998 9998 0.963536 32 1 160538 4635 0.006857 0.999994 0.000006 0.993143 6180.666667 0.037412
9999 9999 0.966488 17 0 160539 4650 0.003643 1.000000 0.000000 0.996357 6200.000000 0.037529

10000 rows × 12 columns

There is a lot of data here. Let's examine the data graphically.

In [26]:
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE_W']



trace = go.Scatter(
    x = x1,
    y = y1,
    name='False Positive Rate')

trace2 = go.Scatter(
    x = x1,
    y = y2,
    name='False Negative Rate'
)

trace3 = go.Scatter(
    x = x1,
    y = y3,
    name='Weighted False Classification Rate'
)

layout = go.Layout(
    title='Weighted Mis-Classification Rates BY CUT OFF SCORE',
    xaxis=dict(
        title='CUT OFF SCORE',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Weighted False Positive and False Negative Rates',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2,trace3]  
fig = go.Figure(data=data, layout=layout)

#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines') 

Based on the Chart above, we can see that the best cut-off score is around .88 or so.

In [27]:
gains=gains.sort_values(by=['FALSE_CLASSIFICATION_RATE_W'], ascending=[True])
In [28]:
gains=gains.head(1)
In [29]:
gains
Out[29]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS_W FALSE_CLASSIFICATION_RATE_W
9871 9871 0.8802 1136 996 159543 3531 0.243411 0.993796 0.006204 0.756589 5372.0 0.032517

By querying the data, we can see it is precisely .8802.

In [30]:
dfx=pd_data
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .8802)), 0, 1)
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
FAILURE_TARGET       0     1
Y_FAIL                      
0               159543  3531
1                  996  1136
Out[30]:
FAILURE_TARGET 0 1
Y_FAIL
0 0.978347 0.021653
1 0.467167 0.532833

We assumed that false negatives were more expensive than false positives. This new assumption means the best answer changes. Specifically, the number of false negatives decreased from 3982 to 3531.

4.4 Finding a cut-off when the economic costs are known.

In the last example, we will assume that we have detailed costs for our business problem.

Let's assume the following:

If a false positive occurs, it costs the organization 2,500 in unnecessary repairs. If a false negative occurs, it costs the organization 2,500 in repairs and 25,000 in lost production.

In [31]:
#Define the False Positive and False Negative Weights.

cost_of_a_false_positive=2500
cost_of_a_false_negative=27500

fp_weight=(cost_of_a_false_positive)/(cost_of_a_false_positive+cost_of_a_false_negative)
fn_weight=(cost_of_a_false_negative)/(cost_of_a_false_positive+cost_of_a_false_negative)
In [32]:
#Define the False Classification Rate

tips['FALSE_CLASSIFICATIONS_W'] = np.where(((2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)) >= tips.TOTAL_OBS)), tips.TOTAL_OBS, 2*(((fp_weight)*tips.FALSE_POSITIVES+(fn_weight)*tips.FALSE_NEGATIVES)))

tips['FALSE_CLASSIFICATION_RATE_W']=tips.FALSE_CLASSIFICATIONS_W/(tips.TOTAL_OBS)

tips['TOTAL_COST']=tips.FALSE_POSITIVES*cost_of_a_false_positive+tips.FALSE_NEGATIVES*cost_of_a_false_negative




gains=tips[['GROUPS','CUT-OFF','TRUE_POSITIVES','FALSE_POSITIVES','TRUE_NEGATIVES','FALSE_NEGATIVES','SENSITIVITY',
            'SPECIFICITY','FALSE_POSITIVE_RATE','FALSE_NEGATIVE_RATE','FALSE_CLASSIFICATIONS_W','FALSE_CLASSIFICATION_RATE_W','TOTAL_COST']]

gains
Out[32]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS_W FALSE_CLASSIFICATION_RATE_W TOTAL_COST
0 0 0.001404 4667 160539 0 0 1.000000 0.000000 1.000000 0.000000 26756.500000 0.161958 401347500
1 1 0.002718 4667 160522 17 0 1.000000 0.000106 0.999894 0.000000 26753.666667 0.161941 401305000
2 2 0.003173 4667 160505 34 0 1.000000 0.000212 0.999788 0.000000 26750.833333 0.161924 401262500
3 3 0.003342 4667 160489 50 0 1.000000 0.000311 0.999689 0.000000 26748.166667 0.161908 401222500
4 4 0.003520 4667 160472 67 0 1.000000 0.000417 0.999583 0.000000 26745.333333 0.161891 401180000
... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 9995 0.958857 71 12 160527 4596 0.015213 0.999925 0.000075 0.984787 8428.000000 0.051015 126420000
9996 9996 0.959056 53 12 160527 4614 0.011356 0.999925 0.000075 0.988644 8461.000000 0.051215 126915000
9997 9997 0.961913 43 7 160532 4624 0.009214 0.999956 0.000044 0.990786 8478.500000 0.051321 127177500
9998 9998 0.963536 32 1 160538 4635 0.006857 0.999994 0.000006 0.993143 8497.666667 0.051437 127465000
9999 9999 0.966488 17 0 160539 4650 0.003643 1.000000 0.000000 0.996357 8525.000000 0.051602 127875000

10000 rows × 13 columns

In [33]:
x1 = tips['CUT-OFF']
y1 = tips['FALSE_POSITIVE_RATE']
y2 = tips['FALSE_NEGATIVE_RATE']
y3 = tips['FALSE_CLASSIFICATION_RATE_W']



trace = go.Scatter(
    x = x1,
    y = y1,
    name='False Positive Rate')

trace2 = go.Scatter(
    x = x1,
    y = y2,
    name='False Negative Rate'
)

trace3 = go.Scatter(
    x = x1,
    y = y3,
    name='Weighted False Classification Rate'
)

layout = go.Layout(
    title='Weighted Mis-Classification Rates BY CUT OFF SCORE',
    xaxis=dict(
        title='CUT OFF SCORE',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='False Positive and False Negative Rates',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2,trace3]  
fig = go.Figure(data=data, layout=layout)

#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines') 

Let's examine the relationship between Total Costs and the cut-off or threshold level.

In [34]:
x1 = tips['CUT-OFF']
y1 = tips['TOTAL_COST']



trace = go.Scatter(
    x = x1,
    y = y1,
    name='Total Cost')


layout = go.Layout(
    title='Cut-Off Levels and Total Cost',
    xaxis=dict(
        title='CUT OFF SCORE',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Total Costs',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace]  
fig = go.Figure(data=data, layout=layout)

#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines') 
In [35]:
gains=gains.sort_values(by=['TOTAL_COST'], ascending=[True])
gains=gains.head(1)
gains
Out[35]:
GROUPS CUT-OFF TRUE_POSITIVES FALSE_POSITIVES TRUE_NEGATIVES FALSE_NEGATIVES SENSITIVITY SPECIFICITY FALSE_POSITIVE_RATE FALSE_NEGATIVE_RATE FALSE_CLASSIFICATIONS_W FALSE_CLASSIFICATION_RATE_W TOTAL_COST
9135 9135 0.665541 3102 11189 149350 1565 0.664667 0.930304 0.069696 0.335333 4734.0 0.028655 71010000

Given the actual costs of a false positive and a false negative, the best cut-off is .665541.

In [36]:
dfx['Y_FAIL'] = np.where(((dfx.P_FAIL <= .665541)), 0, 1)
In [37]:
print(pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET, dropna=False))
pd.crosstab(dfx.Y_FAIL, dfx.FAILURE_TARGET).apply(lambda r: r/r.sum(), axis=1)
FAILURE_TARGET       0     1
Y_FAIL                      
0               149350  1565
1                11189  3102
Out[37]:
FAILURE_TARGET 0 1
Y_FAIL
0 0.98963 0.01037
1 0.78294 0.21706

Note that the number of False Positives increases substantially from our previous example.

5. Summary and conclusions

As you can see from these three scenarios, the best cut-off will depend greatly on the economics of your problem. Because of this, it is important to understand economic costs of both a false positive and a false negative. Just like most things in data science. The "best" answer depends on the context. If a false negative means a bolt will fly off and potentially injure someone, you have to make sure your model predictions reflect this hazard. Data science, like all things in this world, is subject to the context in which it is applied.

Author

Shad Griffin, is a Data Scientist at the IBM Global Solution Center in Dallas, Texas


Copyright © IBM Corp. 2021. This notebook and its source code are released under the terms of the MIT License.